Using GeDi on the T cell dataset (GSE130842)
The data illustrated in this document is an RNA-seq dataset, available at the Gene Expression Omnibus under the accession code GSE130842.
The data represents a mouse data set of 32 different samples across different tissues and conditions. The data is part of a manuscript to analyse different tissue regulatory T cell populations (Delacher et al. 2020) - the manuscript is available on [Pubmed] (https://pubmed.ncbi.nlm.nih.gov/31924477/).
We load the packages required to perform all the analytic steps presented in this document.
library("DESeq2")
library("topGO")
library("pcaExplorer")
library("ideal")
library("GeneTonic")
library("apeglm")
library("dplyr")
library("org.Mm.eg.db")
library("GeDi")
library("visNetwork")
In this example we analyse data available on the Gene Expression Omnibus under accession number GSE130842. From the available data, we downloaded the GSE130842_Count_table_Delacher_et_al_2019.xlsx Excel file, which is also available in the Github repository to this document.
We will preprocess the data for its use in Gedi
according to the workflow described in the package vignette of
Gedi. The package vignette is available through using
the command browseVignette(GeDi) after successfully installing the package.
In the first step of the preprocessing, we will generate a DESeqDataset
(Love, Huber, and Anders 2014) from the count table available in the Excel file. We will further
preprocess this object, called dds_tregs, to prepare the data for its use in
GeDi. We will also read in some metadata that we have set up for the dataset.
This file will also be available in the repository
as well as the final generated dds_tregs.
# We read in the data form the Excel file
count_df <- readxl::read_excel("GSE130842_Count_table_Delacher_et_al_2019.xlsx")
# We transform the data into a matrix and set the gene ids as the rownames of
# the final matrix
count_matrix <- as.matrix(count_df[, -1])
rownames(count_matrix) <- count_df$ID
# We read in the metadata
coldata <- readxl::read_excel("GSE130842_metadata.xlsx")
coldata
## # A tibble: 56 × 5
## ID_GEO group group2 rep_nr tissue
## <chr> <chr> <chr> <dbl> <chr>
## 1 Klrg1-Il1rl1-Treg_BM_R1 Klrg1-Il1rl1-Treg_BM Klrg1-Il1… 1 bonem…
## 2 Klrg1-Il1rl1-Treg_BM_R2 Klrg1-Il1rl1-Treg_BM Klrg1-Il1… 2 bonem…
## 3 Klrg1-Il1rl1-Treg_BM_R3 Klrg1-Il1rl1-Treg_BM Klrg1-Il1… 3 bonem…
## 4 Klrg1-Il1rl1-Treg_BM_R4 Klrg1-Il1rl1-Treg_BM Klrg1-Il1… 4 bonem…
## 5 Klrg1-Il1rl1-Treg_Spleen_R1 Klrg1-Il1rl1-Treg_Spleen Klrg1-Il1… 1 spleen
## 6 Klrg1-Il1rl1-Treg_Spleen_R2 Klrg1-Il1rl1-Treg_Spleen Klrg1-Il1… 2 spleen
## 7 Klrg1-Il1rl1-Treg_Spleen_R3 Klrg1-Il1rl1-Treg_Spleen Klrg1-Il1… 3 spleen
## 8 Klrg1-Il1rl1-Treg_Spleen_R4 Klrg1-Il1rl1-Treg_Spleen Klrg1-Il1… 4 spleen
## 9 tisTregST2_BM_R1 tisTregST2_BM tisTregST2 1 bonem…
## 10 tisTregST2_BM_R2 tisTregST2_BM tisTregST2 2 bonem…
## # ℹ 46 more rows
# We build up the DESeqDataset object from the count data and the metadata
# As design we will choose the group of the data which indicates the tissue
# of origin as well as t5he type of Tcells in the sample
dds_tregs <- DESeqDataSetFromMatrix(
countData = count_matrix,
colData = coldata,
design = ~group
)
# We transform the columnnames of the dds_tregs to include the replicate number
# of each sample
colnames(dds_tregs) <- paste0(coldata$group, "_r", coldata$rep_nr)
# Lastly we can have a look at our DESeqDataset object
dds_tregs
## class: DESeqDataSet
## dim: 52550 56
## metadata(1): version
## assays(1): counts
## rownames(52550): ENSMUSG00000000001 ENSMUSG00000000003 ...
## ENSMUSG00000114967 ENSMUSG00000114968
## rowData names(0):
## colnames(56): Klrg1-Il1rl1-Treg_BM_r1 Klrg1-Il1rl1-Treg_BM_r2 ...
## Treg_IL2_r3 Treg_IL2_r4
## colData names(5): ID_GEO group group2 rep_nr tissue
Now we can see that we have set up a DESeqDataset object on all the available
samples. However, in this analysis, we will only focus on a subset of samples as
shown in Figure 2 of the original manuscript (Delacher et al. 2020). Hence, we will
subset our dds_tregs object to the groups used in the figure.
dds_tregs <- dds_tregs[, dds_tregs$group %in%
c("KLRGminusNFIL3minusTreg" ,
"KLRGminusNFIL3plusTreg",
"KLRGplusNFIL3plusTreg",
"tisTregST2_BM",
"tisTregST2_Fat",
"tisTregST2_Liver",
"tisTregST2_Lung",
"tisTregST2_Skin"
)]
dds_tregs$group <- droplevels(dds_tregs$group)
design(dds_tregs) <- ~group
In a first analysis step, we do an exploratory data analysis as described in (Ludt et al. 2022) using the pcaExplorer package (Marini and Binder 2019).
We will first apply a variance-stabilizing transformation before we plot a PCA and a sample-to-sample distance heatmap.
# Apply the vst transformation
vst_tregs <- vst(dds_tregs)
# Plot the PCA
pcaExplorer::pcaplot(vst_tregs,
intgroup = "group",
ntop = 1000,
title = "PCA plot - top 1000 most variable genes",
ellipse = FALSE,
text_labels = FALSE
)
# Plot a sampl-to-sample distance heatmap
pheatmap::pheatmap(as.matrix(dist(t(assay(vst_tregs)))))
After we have done some exploratory data analysis, we can proceed with the differential expression analysis. We use the DESeq2 package (Love, Huber, and Anders 2014) for this. The False Discovery Rate is set to 5%.
Before, we run the DESeq2 analysis, we first match the gene ids to gene names using the pcaExplorer package (Marini and Binder 2019). With this we can add the gene names to the results, which are usually better known than gene ids. The gene names will also be later used as “Genes” column in the input for GeDi.
# Create an annodation data frame mapping gene ids to gene names.
anno_df <- pcaExplorer::get_annotation_orgdb(dds_tregs,"org.Mm.eg.db","ENSEMBL")
# Assign a new column SWYMBOL to the dds_tregs object, which will be later used
# as "Genes" column in the input for GeDi
rowData(dds_tregs)$SYMBOL <- anno_df$gene_name[match(rownames(dds_tregs),
anno_df$gene_id)]
# Set the false discovery rate to 5%
FDR <- 0.05
# Perform the differential gene expression analysis
dds_tregs <- DESeq(dds_tregs)
After we performed the analysis, we extract the results for the condition
KLRGplusNFIL3plusTreg vs KLRGminusNFIL3minusTreg as we only want to compare
these two groups. Afterwards we print a summary overview of the previously
extracted results.
We also use the ideal package (Marini, Linke, and Binder 2020) to plot an MA-plot of the results.
In a last step, we add the gene symbols to the resulting DataFrame which will
later serve as our “Genes” column in the input data to GeDi.
# Extract differentially expressed genes
# Perform contrast analysis comparing "KLRGplusNFIL3plusTreg" group to "KLRGminusNFIL3minusTreg" group
# Set a log2 fold change threshold of 1 and a significance level (alpha) of 0.05
res_tregs <- results(dds_tregs,
contrast = c("group", "KLRGplusNFIL3plusTreg", "KLRGminusNFIL3minusTreg"),
lfcThreshold = 1, alpha = 0.05
)
# Print a summary overview of the results
summary(res_tregs)
##
## out of 42631 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 1.00 (up) : 683, 1.6%
## LFC < -1.00 (down) : 140, 0.33%
## outliers [1] : 551, 1.3%
## low counts [2] : 13054, 31%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
# Plot an MA-plot of the results
ideal::plot_ma(res_tregs,
ylim = c(-5, 5),
title = "MAplot - KLRGplusNFIL3plusTreg vs KLRGminusNFIL3minusTreg")
# Add gene symbols to the results in a column "SYMBOL"
res_tregs$SYMBOL <- rowData(dds_tregs)$SYMBOL
Following the differential expression analysis, we perform a functional enrichment
analysis using the topGO package (Alexa and Rahnenfuhrer 2024). Before the analysis,
we first determine the set of background genes to be used, which in our case will
be the set of expressed genes in the data. We also transform the results of our
DE analysis to fit the format expectation of the topGOtable function from the
pcaExplorer package (Marini and Binder 2019).
# Determine the set of background genes as all genes expressed in the dataset
geneUniverseExpr <- rowData(dds_tregs)$SYMBOL[rowSums(counts(dds_tregs)) > 0]
# Extract gene symbols from the DESeq2 results object where FDR is below 0.05
# The function deseqresult2df is used to convert the DESeq2 results to a
# dataframe format
# FDR is set to 0.05 to filter significant results
de_symbols <- deseqresult2df(res_tregs, FDR = 0.05)$SYMBOL
# Perform Gene Ontology enrichment analysis using the topGOtable function from
# the "pcaExplorer" package
topGO_tregs <- topGOtable(
DEgenes = de_symbols,
BGgenes = geneUniverseExpr,
ontology = "BP",
geneID = "symbol",
addGeneToTerms = TRUE,
mapping = "org.Mm.eg.db",
topTablerows = 500
)
After we have performed the functional enrichment analysis, the data is almost
ready to be used with GeDi. However, in its current
state the data is not yet in the correct format expected by GeDi.
GeDi expects the data to have at least two columns,
one named “Genesets” containing some form of geneset identifiers and one named
“Genes” containing a list of genes belonging to the genesets. While this is not
strictly necessary to use GeDi on the data, it
facilitates the use of the app as the app can be used straight away instead of
having to wait for the data to be reformated. The correct data format is however
necessary, if you want to use the data as a parameter as in GeDi(topGO_tregs).
Nevertheless, we want to show you here, how to adapt the data from the topGO analysis to fit the data format requirements of GeDi. For this we simply have to rename the ‘GO.ID’ and the ‘genes’ column of the results as these two columns already contain the input data in the correct format.
# Rename columns in the topGO_tregs dataframe
# Change the column name "GO.ID" to "Genesets"
names(topGO_tregs)[names(topGO_tregs) == "GO.ID"] <- "Genesets"
# Change the column name "genes" to "Genes"
names(topGO_tregs)[names(topGO_tregs) == "genes"] <- "Genes"
After we have renamed the columns, the data is now ready to be used in GeDi.
GeDi on the datasetNow we can start to explore our data using GeDi. For this you can either follow the chunks in this document to prepare the data or we can load the prepared object provided with the repository of this document.
tregs_example <- readRDS("usecase_tregs_example.RDS")
Once we have loaded the data, we can start the app and interactively explore the data set using:
GeDi(genesets = tregs_example)
Once the app is started, you can have interactive guidance of the user interface and its features by usind the introductory tours of each panel of the app.
GeDi’s functions in analysis reportsThe functionality of GeDi can be used also as standalone functions, to be called for example in existing analysis reports in RMarkdown, or R scripts.
In the following chunks, we show how it is possible to call some of the functions on the dataset presented in this document.
# First we extract a representation of all genes in the data
genes <- GeDi::getGenes(topGO_tregs)
# Next we calculate one distance score matrix for the data
mm_score <- GeDi::getMeetMinMatrix(genesets = genes)
rownames(mm_score) <- colnames(mm_score) <- topGO_tregs$Genesets
# We can use several plotting functions of GeDi to plot the data
GeDi::distanceHeatmap(mm_score)
GeDi::distanceDendro(mm_score)
adj <- GeDi::getAdjacencyMatrix(mm_score, 0.3)
rownames(adj) <- colnames(adj) <- topGO_tregs$Genesets
g <- GeDi::buildGraph(adj, topGO_tregs, topGO_tregs$Genesets)
visNetwork::visIgraph(g)
After we have calculated distance scores, we can cluster the data and visualize the results in several different ways.
# First we are clustering the data
clustering <- GeDi::clustering(mm_score, threshold = 0.3, cluster_method = "markov")
# Next we can visualize the results of the clustering
g <- GeDi::buildClusterGraph(clustering,
topGO_tregs,
topGO_tregs$Genesets,
color_by = "Annotated")
visNetwork::visIgraph(g)
g <- GeDi::getBipartiteGraph(clustering, topGO_tregs$Genesets, genes)
visNetwork::visIgraph(g) %>%
visIgraphLayout(layout = "layout_as_bipartite")
GeDi::enrichmentWordcloud(topGO_tregs)
sessionInfo()
## R Under development (unstable) (2024-03-12 r86109)
## Platform: x86_64-apple-darwin20
## Running under: macOS Ventura 13.2.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/Berlin
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] visNetwork_2.1.2 GeDi_1.0.0
## [3] org.Mm.eg.db_3.19.1 dplyr_1.1.4
## [5] apeglm_1.26.0 GeneTonic_2.8.0
## [7] ideal_1.27.0 pcaExplorer_2.30.0
## [9] topGO_2.56.0 SparseM_1.81
## [11] GO.db_3.19.1 AnnotationDbi_1.66.0
## [13] graph_1.82.0 DESeq2_1.44.0
## [15] SummarizedExperiment_1.34.0 Biobase_2.64.0
## [17] MatrixGenerics_1.16.0 matrixStats_1.3.0
## [19] GenomicRanges_1.56.0 GenomeInfoDb_1.40.0
## [21] IRanges_2.38.0 S4Vectors_0.42.0
## [23] BiocGenerics_0.50.0 knitr_1.46
##
## loaded via a namespace (and not attached):
## [1] STRINGdb_2.16.0 fs_1.6.4 bitops_1.0-7
## [4] fontawesome_0.5.2 httr_1.4.7 webshot_0.5.5
## [7] RColorBrewer_1.1-3 doParallel_1.0.17 numDeriv_2016.8-1.1
## [10] dynamicTreeCut_1.63-1 Rgraphviz_2.48.0 tippy_0.1.0
## [13] tools_4.4.0 utf8_1.2.4 R6_2.5.1
## [16] DT_0.33 lazyeval_0.2.2 mgcv_1.9-1
## [19] GetoptLong_1.0.5 withr_3.0.0 prettyunits_1.2.0
## [22] gridExtra_2.3 fdrtool_1.2.17 cli_3.6.2
## [25] Cairo_1.6-2 TSP_1.2-4 labeling_0.4.3
## [28] slam_0.1-50 sass_0.4.9 bs4Dash_2.3.3
## [31] tm_0.7-13 mvtnorm_1.2-4 genefilter_1.86.0
## [34] ggridges_0.5.6 goseq_1.55.0 yulab.utils_0.1.4
## [37] Rsamtools_2.20.0 rentrez_1.2.3 AnnotationForge_1.46.0
## [40] plotrix_3.8-4 bbmle_1.0.25.1 limma_3.60.0
## [43] readxl_1.4.3 rstudioapi_0.16.0 RSQLite_2.3.6
## [46] generics_0.1.3 GOstats_2.70.0 shape_1.4.6.1
## [49] BiocIO_1.14.0 gtools_3.9.5 crosstalk_1.2.1
## [52] dendextend_1.17.1 Matrix_1.7-0 fansi_1.0.6
## [55] abind_1.4-5 lifecycle_1.0.4 yaml_2.3.8
## [58] gplots_3.1.3.1 SparseArray_1.4.1 BiocFileCache_2.12.0
## [61] grid_4.4.0 blob_1.2.4 promises_1.3.0
## [64] bdsmatrix_1.3-7 crayon_1.5.2 shinydashboard_0.7.2
## [67] miniUI_0.1.1.1 lattice_0.22-6 ComplexUpset_1.3.3
## [70] GenomicFeatures_1.56.0 annotate_1.82.0 KEGGREST_1.44.0
## [73] magick_2.8.3 pillar_1.9.0 ComplexHeatmap_2.20.0
## [76] rjson_0.2.21 codetools_0.2-20 glue_1.7.0
## [79] data.table_1.15.4 vctrs_0.6.5 png_0.1-8
## [82] cellranger_1.1.0 gtable_0.3.5 gsubfn_0.7
## [85] assertthat_0.2.1 emdbook_1.3.13 cachem_1.0.8
## [88] xfun_0.43 S4Arrays_1.4.0 mime_0.12
## [91] coda_0.19-4.1 wordcloud2_0.2.1 survival_3.6-4
## [94] pheatmap_1.0.12 seriation_1.5.5 iterators_1.0.14
## [97] statmod_1.5.0 nlme_3.1-164 Category_2.70.0
## [100] bit64_4.0.5 threejs_0.3.3 progress_1.2.3
## [103] filelock_1.0.3 UpSetR_1.4.0 bslib_0.7.0
## [106] KernSmooth_2.23-22 colorspace_2.1-0 DBI_1.2.2
## [109] tidyselect_1.2.1 chron_2.3-61 bit_4.0.5
## [112] compiler_4.4.0 curl_5.2.1 httr2_1.0.1
## [115] BiocNeighbors_1.22.0 BiasedUrn_2.0.11 expm_0.999-9
## [118] NLP_0.2-1 xml2_1.3.6 ggdendro_0.2.0
## [121] DelayedArray_0.30.0 plotly_4.10.4 colourpicker_1.3.0
## [124] bookdown_0.39 rtracklayer_1.64.0 scales_1.3.0
## [127] caTools_1.18.2 RBGL_1.80.0 NMF_0.27
## [130] rappdirs_0.3.3 stringr_1.5.1 digest_0.6.35
## [133] shinyBS_0.61.1 rmarkdown_2.26 ca_0.71.1
## [136] XVector_0.44.0 htmltools_0.5.8.1 pkgconfig_2.0.3
## [139] base64enc_0.1-3 lpsymphony_1.32.0 highr_0.10
## [142] dbplyr_2.5.0 fastmap_1.1.1 rlang_1.1.3
## [145] GlobalOptions_0.1.2 htmlwidgets_1.6.4 UCSC.utils_1.0.0
## [148] shiny_1.8.1.1 farver_2.1.1 jquerylib_0.1.4
## [151] IHW_1.32.0 jsonlite_1.8.8 BiocParallel_1.38.0
## [154] GOSemSim_2.30.0 RCurl_1.98-1.14 magrittr_2.0.3
## [157] GenomeInfoDbData_1.2.12 patchwork_1.2.0 munsell_0.5.1
## [160] Rcpp_1.0.12 proto_1.0.0 shinycssloaders_1.0.0
## [163] viridis_0.6.5 sqldf_0.4-11 stringi_1.8.4
## [166] rintrojs_0.3.4 MASS_7.3-60.2 zlibbioc_1.50.0
## [169] plyr_1.8.9 parallel_4.4.0 ggrepel_0.9.5
## [172] Biostrings_2.72.0 splines_4.4.0 hash_2.2.6.3
## [175] hms_1.1.3 geneLenDataBase_1.39.0 circlize_0.4.16
## [178] locfit_1.5-9.9 igraph_2.0.3 rngtools_1.5.2
## [181] reshape2_1.4.4 biomaRt_2.60.0 XML_3.99-0.16.1
## [184] evaluate_0.23 BiocManager_1.30.23 tweenr_2.0.3
## [187] foreach_1.5.2 httpuv_1.6.15 backbone_2.1.3
## [190] polyclip_1.10-6 tidyr_1.3.1 purrr_1.0.2
## [193] heatmaply_1.5.0 clue_0.3-65 ggplot2_3.5.1
## [196] ggforce_0.4.2 gridBase_0.4-7 xtable_1.8-4
## [199] restfulr_0.0.15 later_1.3.2 viridisLite_0.4.2
## [202] tibble_3.2.1 memoise_2.0.1 registry_0.5-1
## [205] GenomicAlignments_1.40.0 cluster_2.1.6 shinyWidgets_0.8.6
## [208] shinyAce_0.4.2 GSEABase_1.66.0 BiocStyle_2.32.0